Milestone 1: Proposal

Group 5:

Abby Ross | Joseph Distler | Nathan Dierkes | Viraj Vilas Rane | Xinyue Chen | Yinkai Xiong

Introduction

Business Problem:

Airbnb is a company that provides a platform users classified as homeowners to rent out rooms, or the entire house, to any of the Airbnb users classified as renters. This is primarily done in high tourism locations, but is also fairly prevelant in most of the large US cities. Airbnb would like to increase its user base by identifying potential hosts to convince them to use their property as a rental location. Airbnb receives a percentage of the profit that the rentals on their platform create; therefore it is in their best interest to grow the number of rental properties for their potential customers. The company has already gathered a large amount of data on its current listings and has come to our group to analyze said data. They would like to provide the hosts with suggestions for their property descriptions, which amenities to offer, etc., and use these values to predict the ratings, listing prices, booking percentage of available dates, and more. To pilot this idea, the company would like to focus on one city in the United States before rolling this predictive model out to other locations. The dataset to be used in this analysis will be from Inside Airbnb, specifically the Chicago, Illinois datasets. The company would like to use this analysis to determine the ideal property type, location, and amenities to look for potential rental locations, and then predict how the host’s actions (descriptions, response rate, etc.) would affect the listing’s potential.

Scenarios:

Price prediction

The hosts want to lease their properties on Airbnb, but they are not sure what prices they should put for their new properties (of differing types). They want to build and use a statistical model to predict the Airbnb rental trend next year. As the economy grows every year, Airbnb is looking for a new price range for their new properties. The hosts can use this price reference to determine whether they should list their properties on Airbnb. Airbnb can use this model to make their marketing plan to recruit their target partners (hosts).

Customer satisfaction prediction

Customers are looking for properties to rent that fit their budget of expenditure in comparison with the amenities offered at the rental locations. Therefore, customers are looking for reliable feedback and rating system that aligns with what they are willing to pay to stay there. Airbnb would use this model to offer the customers the right properties, within their price range, to increase the chance of them using their platform. This model will be based on the feedback that the customer provides after visiting a particular property, and then will use this data to improve its recommendation system for future customers who look at the property.

Recommendation system

As the customers are clicking on the property that they are interested in, Airbnb also offers the customers similar properties they might like. This allows the customers to explore more options related to their search criteria and increases the chances for the customer to go for the options suggested by Airbnb.

The result should help Airbnb to answer multiple questions regarding their future operations, including:

What will the price for a new property be next year (rise or fall)? (price prediction)

Observations that could help us to predict property price next year:

What rating score customers would give to a new property? (customer satisfaction prediction)

Observations that could help us to predict the satisfaction of customers for new properties:

What types of housings should the company recommend to customers after they click on one specific housing? (Recommendation system)

Observations that could help us to recommend housing to Guests:

Data Source and Collection

Data Source

The main data is sourced from the Inside Airbnb website at: http://insideairbnb.com/get-the-data.html. This data was collected by Airbnb and posted online for use by anyone.

City of Chicago latitude and longitude sourced from: https://www.latlong.net/place/chicago-il-usa-1855.html

Key Information

Key information in the dataset includes, but is not limited to:

Data Manipulation

In order to start the data analysis, we will need to import a variety of packages.

We will read in the data, which was downloaded from the Airbnb website (InsideAirbnb) and look at its info.

We can see that there are 74 columns, and 6,366 observations, however some values are missing in various columns.

Data Cleansing

Drop off irrelevant columns:

Let us check for duplicate values and columns.

We can see that the dataframe did not include any duplicate observations.

Bathrooms Column:

Looking at the 'bathrooms_text' column, we see that it is not very usable in its current state. We will split the bathroom text column into two: one containing a float variable for the number of bathrooms, and the other an additional descriptor of the bathroom (shared/private).

Above are the unique values left for the float variable in the column 'bathrooms'. The test portion requires a little more refining:

This leaves us with only 1,587 observations that contain one of the bathroom descriptors, 'shared' or 'private'; the rest are missing values since the original data did not contain text for them.

DateTime Columns:

Next, let's convert the datetime columns into the proper datatype.

In order to see the length of time that a host has been active on the platform, we will create a new column called 'host_age' which we can visualize later. Note: this will also be converted to a float variable instead of a timedelta variable.

'Rate' Columns:

We can see that we will need to convert the percentage columns ('host_response_rate' and 'host_acceptance_rate') into float variables.

Boolean Columns:

We will convert the 't' and 'f' values to binary float values in all of the boolean columns for later analysis, where 1 will mean "True".

Price Column:

Convert the price column to a float data type.

License Column:

Since we will not be able to use the individual license numbers, we will convert this column into a categorical variable, where 1 means the listing has a license, and 0 does not. For our purposes, we will consider a license that is still pending as not having a license (i.e. 0).

Ratings

Let us create an average rating column that includes the average of all review scores values.

Missing Values

For the column 'host_response_time', convert to binary value and create a new column represent if a host respose is within a day.

For columns: description, neighborhood_overview, host_location, host_about, host_neighbourhood using 'Unknown' to fill the missing value, because these columns does not have direct effect on project topic.

For columns: host_name, host_since, host_has_profile_pic, host_identity_verified, it is easy to see below these rows contains many NaN value, so drop directly.

For columns: bedrooms and beds, using mode to fill missing value, because fill with natural value to make the result less biased.

For columns: first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, reviews_per_month, most missing value caused by 'number_of_reviews'==0. So, fill those missing value with 0.0. Later review analysis process will exclude these rows since no number of reviews.

Data Exploration and Visualization

Let us explore the data in the dataset.

Unique Hosts

First, how many unique hosts are there?

How many listings does each host have in the Chicago area?

Here we can see that out of the 3,371 unique hosts, 799 have more than one listing in the Chicago area. Interestingly, there is one host id with 260 listings.

Let's graph this data to see the distribution of hosts with differing numbers of listings.

It is obvious that a vast majority of hosts have only one listing in the Chicago area. Let us look at the outliers.

Listing License

A majority of the listings have a license, but it is not a large majority.

Host Response Time

Here we can see there are four categories for the response time. Let's define a system for rating the response time by using floating numbers. We will assign the values in hours and as follows:

Host Verifications

In order to simplify future analysis, let us count the number of verifications the host has and list this in a new column.

We can see that the verifications are separated by a comma, so we will use this to count the number of verifications each host has.

In order to catch any observations where hosts have no verifications, we will set the number of verifications to zero where the host_verifications = 'None'. This is important since the code above would have counted both 'none' and an observation without a comma (i.e. only one verification) as 1.

On average, hosts have about 5-6 different identity verifications.

Amenities

Here we repeat the same process for number of amenities listed.

Let us take a closer look to see what outliers are in this variable.

Distance from Center of Chicago

Let's calculate the distance of the listings from the center of the city of Chicago. We will use the following coordinates: 41.8781° N, 87.6298° W.

We will use the Haversine formula to calculate the distance in miles. In order to do so, we will first define a function to perform the calculation.

Then, apply the formula to each observation in the data set, returning the answer in a new column for the distance from the center of the city.

The plot shows a skew towards the center of the city, which makes sense that a larger number of listings would be closer to the center.

Property and Room Types

Additional Box Plots

Price Outliers

Regression Plots

Price v. Distance from center of city

The regression plot shows a correlation between the distance from the center of the city and the price of the listing. In general, the listings closer to the center of the city are worth more than the ones further away. This makes sense, as property values are generally higher in more populated areas.

Regression Analysis and Prediction of Price based on distance from the city

We can use these data fields to make a prediction of price based on the distance from the city.

Based on the above OLS model, we can say that on average a property that is 1 mile further from the center of the city than another would be $12.73/night cheaper.

Price v. Number of Amenities

On average, the listings with more amenities listed are also posted for a higher rental price.

Regression Analysis and Prediction of Price based on Amenities

Again, we can use this data to develop a regression model to predict a property's price based on the number of amenities that are listed in the field.

Here we can use this model to make predictions based on new properties. For example, if a property were to have 35 amenities listed we can predict the listing price to be about (1.76 x 35) + 104.40 = $166/night.

Average Rating v. Response Time

The regression plot shows a negative trend in the average rating related to the time it takes the host to respond to messages and booking requests.

Average Rating v. Response Rate

In general, the higher the hosts' response rate is, the higher their average listing rating is.

Average Rating v. Number of Verifications

While it appears there is a positive correlation between the average rating and the number of verifications a host has, it is very small.

Word Clouds

Some of the columns are free text entered by the hosts. We will look at the word clouds to explore which keywords are used most frequently. The data fields we will focus on are:

First, let's define a function to make it easy to generate our wordclouds from the input column:

Listing Description

In the description field, we see a lot of key words show up that would make sense to be in the field. Interestingly the work "vaccinated" has already made it into the one of the most used words.

Neighborhood Overview

In the neighborhood overview, there are once again a lot of words that you would expect to show up when describing a location. "Ukranian" is a village within Chicago, along with "Lasalle".

About the Host

Once again here we see common words to describe hosts, along with some of the more popular names in the area.

Joint Plots

Distance center from the city vs. Average rating

According to above plot, customers tend to rate the places that are closing to the center of Chicago with the score from 4 to 5 rating.

Accommodates vs. Average rating

Smaller accommodations tend to have higher average of rating score.

Accomodation vs. Availability for 30 days

According to above plot, we can see there are big density starting from bottom left corner and spreading to the top and right. We can see that smaller accommodations have lower chance of being availability in 30 days.

Accommodation vs. Distance from center of Chicago

We can see that smaller accommodations tend to stay close to the city.

Heat Map

Geospatial visualization

The geography of Chicago has a wide spead city area with downtown and suburban locations available for properties to vary with higher pricing as we move towards the downtown. We will consider the following variables for observing varying prices for different locations

We take the 'latitude' and 'longitude' variables for identifying the property location hotspots, and 'price' variable to define the intensity of high and low priced rentals.

The Coordinates channel uses the location field, which contains arrays of latitude-longitude pairs.

The Intensity field uses the price field, which contains rental price for each property.

Dimension Reduction

Principal Component Analysis

In order to complete a Principal Component Analysis, we need to only select the numeric (non-datetime) values and drop values with NaNs.

Next we will scale the data.

PCA with all variables

For the fist PCA, we will include all of the variables (i.e. all columns from the selection above).

PCA with 80% variance explained

Taking the first 17 principal components will correspond to 80% of the variance explained.

From the above correlation table, we can find that the pairwise correlations between two components are close to zeros. This means that all these components are orthogonal (not correlated). There is no multicollinearity among principal components.

In order to graph the results, we will restrict the PCA analysis to two components.

PCA with 2 components

It appears that there are three distinct groups in the graph, which are determined by the PCA2 values.

We can look at the loadings used to calculate the components from the original variables:

Predictive Methods

In this section, we will look at different predictive models for various target variables.

K-Nearest Neighbors - Rating Prediction

We will use KNN modeling to build a model to predict the average rating of new listings based on the other predictors. First we need to load the appropriate packages.

Next we will divide the variables into the independent (X) and dependent, or target, variable (y). In this model, we will be attempting to predict 'price'.

We also already selected the appropriate columns for the PCA analysis earlier, and can start with the same pre-normalized data.

Since the other rating predictors would not be available for future data, these should be dropped from the predictors as well.

In order to test the performance of our model, we need to split the original dataset into test and train subsets. We will use a 20/80 test/train split.

Above we can see that the split is fairly eqaully distibuted.

Next we will scale the data for the analysis.

Now we will iterate through various values of k in order to tune the model.

Random Forest Classifier- predicting the most preferred room type for customers

We will use RandomForestClassifier to predict the most preferred room type based on customer satisfaction (ratings and reviews).

We can see that there are some missing values in the predictors, so we fill those missing values.

We now split the data into train and test datasets by splitting the 80% train and 20% test data via train_test_split() method

We scale the training and testing data for obtaining optimal results from the RF model.

From the above feature_importances method, we can see that avg_rating, reviews_per_month, number_of_reviews are variables of high importance and these variables have high significance in training RF classifier.

Based on RF classifier result without hyper parameter tuning with get accuracy score of 70.44%, now let's tune hyper parameters and observe the difference in predicting the the most preferred room type based on ratings and reviews.

The RF Classifier with hyper parameter tuning gives us accuracy score of 72.76% with best optimal hyper parameters.

The above result obtained from the model, we can depict that the preferred room type based on customer ratings and reviews is "Entire home/apt" followed by "Private room", "Hotel room", and lastly "Shared Room".


K-Nearest Neighbors - Price Prediction

We will use KNN modeling to build a model to predict the price of new listings based on the other predictors.

Preprocess Data

Data Partition We use the sklearn.model_selection.train_test_split() method to split the dataset into test and training sets. Major parameters are:

Normalize Data Because the k-NN needs to calculate distance between observations, it's better to normalize data as we have variables measured in different scales.

Tune the k-NN Regressor The choice of the paramter value k has impact on the performance of the k-NN algorithm. In the following, we tune the k parameter based on Root Mean Squared Error and R2.

Decision Tree Prediction on Customer Satisfaction

Set related columns that might affect customers satisfaction to X.

Replace all value less than 4 to 0 and all value higher than 4 to 1 in "review_scores_rating" column. Then set it to y.

Data Partition

Split both X and y into train and test set to check the model| performance.

Decision Tree Classifier

Decision Tree classifier gives AUC score on 0.97, accuracy on 0.98

Based on the columns that might affect customer rating, to predict if a customer would rate a property higher than 4.0(which means customer satisfied with that property).

Decision Tree Model - Customer Group Prediction

Use the Decision Tree Classifier to classify which type of customer group is airbnb's ideal customers in Chicago

0 is small group including 1 to 4 people, and 1 is big group including groups with more than 4 people.

The AUC score is 0.84 and accuracy score is 0.86. That means that 86 percents of observations that are correctly predicted.

Hyperparamater Tunning

After hyperparameter tunning, the performace of model is getting better because the accuracy scores are increasing from 86 percent to 90 percent.

Support Vector Machine - Superhost Prediction

Use Support Vector Machine to predict if a host is likely to be a superhost based off of their total listings, response rate, and acceptance rate. Could be useful because AirBnB doesn’t specify exact credentials to get into the program.

Drop the missing values

0 = Not Superhost 1 = Superhost

Summary of Findings

Throughout our analysis we consistently found that the data supported most common understandings of how the variables related to one another. For example, listings that were closer to the center of the city on average were more expensive, accomadated less amount of people, and had higher ratings. We are even able to make simple regression models to predict variables based on another variable.

Towards the end of the analysis we were able to develop more complicated regression and classification models based on machine learning algorithms. The more robust models are designed to predict variables based on a larger number of other indicator (or independant) variables. However, with the high dimensionality of the models their uses may be limited.

One model goal was to predict customer ratings on a new property based. This was done using a K-Nearest Neighbors model with hypertuning performed on the 'k' neighbors variable. Using the root mean squared error as the evaluation method, the ideal 'k' value was found to be 2 with an RMSE of 0.383. The model was trained using an 80% split from the original data, with the remaining 20% used for testing in a 5-fold cross validation.

Using Random Forest Classifier we were able to predict the most preferred house_type by customers. The analysis is conducted based on customer ratings and reviews with independent variables like number_of_reviews, reviews_per_month, avg_rating, number_of_reviews_ltm (last twelve months), etc. We used scaled variables for training the model and received an accuracy_score of 72%. The model prediction enabled us to identify and rank the types of room that has the highest probability to get choosed/selected by customers.

Using the decision tree classifier to classify the accommodation into two group of customers, which are big group including groups with more than 4 people and small groups including groups with 4 or less than 4 people, we will be able to predict the ideal group numbers of customers using airbnb in Chicago. The accuracy score of final model is 90 percent and that means that 90% of observations are correctly labeled. The smaller groups of customers are using airbnb in Chicago more than bigger groups of customers. Chicago can use this analysis to circle their ideal customer targets and improve more propeties that can serve this group of customers.

Using KNN regressor, we are able to predict the price based on property details that customer entered. The analysis is conducted based on property details, customer rating, and location. We used Root Mean Squared Error and R2 to evaluate each k's performance, Finally, we got best performance K=6. The model was trained using 80% split and was tested using 20% split from the original dataset. We also add the random state to make sure every time run it, we will get same result.

Using the decision tree classifier to classify the customer rating into two categories, which are the customer review rating is equal to & higher than 4.0 another one is lower than 4.0. We split the dataset into train and test datasets based on 80%-20%. The predictors include those values that might affect customers' ratings, such as those other detail ratings, room type, and host-related, etc. This model is able to predict the future rating based on those input values. The final result in this model got an AUC score of 0.97, and an accuracy of 0.98.

Using Support Vector Machine to attempt to predict a host being a superhost, we didn't have much success. It seems as though using host listings count, host response rate, and host acceptance rate are not great at predicting superhost status. We thought that this may be a useful thing to try and predict due to the fact that AirBnb does not give clear guidliness for achieving the status. Because of the low accuracy and AUC scores, we believe that AirBnb must use additional factors to determine whether or not a host qualifies for the superhost program.